Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.
Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.
Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.
The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).
“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.
The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:
It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.
“1” in the target variables should be considered as “failure” and “0” represents “No failure”.
#for data manipulation
import numpy as np
import pandas as pd
#for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
#for statistics
import scipy.stats as stats
#for imputing (just in case there are missing values in the data)
from sklearn.impute import SimpleImputer
#for dividing training set into validation set and training set
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.model_selection import KFold
#for oversampling and undersampling
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
#for regression models
from sklearn.linear_model import LogisticRegression
#for building different models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
#for deriving performance metrics of models
from sklearn import metrics
from sklearn.metrics import (precision_score, recall_score, f1_score, accuracy_score, confusion_matrix, roc_auc_score)
from sklearn.metrics import r2_score
#for turning hyperparameters of models
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
#for making pipeline
from sklearn.pipeline import Pipeline
#for ignoring warnings
import warnings
warnings.filterwarnings('ignore')
#loading the training set data into a dataframe called train
df_train = pd.read_csv('Train.csv.csv')
#loading the testing set data into a dataframe called test
df_test = pd.read_csv('Test.csv.csv')
The df_test dataframe is used only for the purpose of testing the models, so the main focus of the data analysis is the df_train dataframe. The df_test dataframe will also be viewed for the purpose of sanity checking and gaining a better understanding of the distribution of data.
#viewing the first five rows of df_train
df_train.head()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -4.464606 | -4.679129 | 3.101546 | 0.506130 | -0.221083 | -2.032511 | -2.910870 | 0.050714 | -1.522351 | 3.761892 | ... | 3.059700 | -1.690440 | 2.846296 | 2.235198 | 6.667486 | 0.443809 | -2.369169 | 2.950578 | -3.480324 | 0 |
| 1 | 3.365912 | 3.653381 | 0.909671 | -1.367528 | 0.332016 | 2.358938 | 0.732600 | -4.332135 | 0.565695 | -0.101080 | ... | -1.795474 | 3.032780 | -2.467514 | 1.894599 | -2.297780 | -1.731048 | 5.908837 | -0.386345 | 0.616242 | 0 |
| 2 | -3.831843 | -5.824444 | 0.634031 | -2.418815 | -1.773827 | 1.016824 | -2.098941 | -3.173204 | -2.081860 | 5.392621 | ... | -0.257101 | 0.803550 | 4.086219 | 2.292138 | 5.360850 | 0.351993 | 2.940021 | 3.839160 | -4.309402 | 0 |
| 3 | 1.618098 | 1.888342 | 7.046143 | -1.147285 | 0.083080 | -1.529780 | 0.207309 | -2.493629 | 0.344926 | 2.118578 | ... | -3.584425 | -2.577474 | 1.363769 | 0.622714 | 5.550100 | -1.526796 | 0.138853 | 3.101430 | -1.277378 | 0 |
| 4 | -0.111440 | 3.872488 | -3.758361 | -2.982897 | 3.792714 | 0.544960 | 0.205433 | 4.848994 | -1.854920 | -6.220023 | ... | 8.265896 | 6.629213 | -10.068689 | 1.222987 | -3.229763 | 1.686909 | -2.163896 | -3.644622 | 6.510338 | 0 |
5 rows × 41 columns
#viewing the last five rows of df_train
df_train.tail()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19995 | -2.071318 | -1.088279 | -0.796174 | -3.011720 | -2.287540 | 2.807310 | 0.481428 | 0.105171 | -0.586599 | -2.899398 | ... | -8.273996 | 5.745013 | 0.589014 | -0.649988 | -3.043174 | 2.216461 | 0.608723 | 0.178193 | 2.927755 | 1 |
| 19996 | 2.890264 | 2.483069 | 5.643919 | 0.937053 | -1.380870 | 0.412051 | -1.593386 | -5.762498 | 2.150096 | 0.272302 | ... | -4.159092 | 1.181466 | -0.742412 | 5.368979 | -0.693028 | -1.668971 | 3.659954 | 0.819863 | -1.987265 | 0 |
| 19997 | -3.896979 | -3.942407 | -0.351364 | -2.417462 | 1.107546 | -1.527623 | -3.519882 | 2.054792 | -0.233996 | -0.357687 | ... | 7.112162 | 1.476080 | -3.953710 | 1.855555 | 5.029209 | 2.082588 | -6.409304 | 1.477138 | -0.874148 | 0 |
| 19998 | -3.187322 | -10.051662 | 5.695955 | -4.370053 | -5.354758 | -1.873044 | -3.947210 | 0.679420 | -2.389254 | 5.456756 | ... | 0.402812 | 3.163661 | 3.752095 | 8.529894 | 8.450626 | 0.203958 | -7.129918 | 4.249394 | -6.112267 | 0 |
| 19999 | -2.686903 | 1.961187 | 6.137088 | 2.600133 | 2.657241 | -4.290882 | -2.344267 | 0.974004 | -1.027462 | 0.497421 | ... | 6.620811 | -1.988786 | -1.348901 | 3.951801 | 5.449706 | -0.455411 | -2.202056 | 1.678229 | -1.974413 | 0 |
5 rows × 41 columns
#viewing the first five rows of df_test
df_test.head()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.613489 | -3.819640 | 2.202302 | 1.300420 | -1.184929 | -4.495964 | -1.835817 | 4.722989 | 1.206140 | -0.341909 | ... | 2.291204 | -5.411388 | 0.870073 | 0.574479 | 4.157191 | 1.428093 | -10.511342 | 0.454664 | -1.448363 | 0 |
| 1 | 0.389608 | -0.512341 | 0.527053 | -2.576776 | -1.016766 | 2.235112 | -0.441301 | -4.405744 | -0.332869 | 1.966794 | ... | -2.474936 | 2.493582 | 0.315165 | 2.059288 | 0.683859 | -0.485452 | 5.128350 | 1.720744 | -1.488235 | 0 |
| 2 | -0.874861 | -0.640632 | 4.084202 | -1.590454 | 0.525855 | -1.957592 | -0.695367 | 1.347309 | -1.732348 | 0.466500 | ... | -1.318888 | -2.997464 | 0.459664 | 0.619774 | 5.631504 | 1.323512 | -1.752154 | 1.808302 | 1.675748 | 0 |
| 3 | 0.238384 | 1.458607 | 4.014528 | 2.534478 | 1.196987 | -3.117330 | -0.924035 | 0.269493 | 1.322436 | 0.702345 | ... | 3.517918 | -3.074085 | -0.284220 | 0.954576 | 3.029331 | -1.367198 | -3.412140 | 0.906000 | -2.450889 | 0 |
| 4 | 5.828225 | 2.768260 | -1.234530 | 2.809264 | -1.641648 | -1.406698 | 0.568643 | 0.965043 | 1.918379 | -2.774855 | ... | 1.773841 | -1.501573 | -2.226702 | 4.776830 | -6.559698 | -0.805551 | -0.276007 | -3.858207 | -0.537694 | 0 |
5 rows × 41 columns
#viewing the last five rows of df_test
df_test.tail()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | -5.120451 | 1.634804 | 1.251259 | 4.035944 | 3.291204 | -2.932230 | -1.328662 | 1.754066 | -2.984586 | 1.248633 | ... | 9.979118 | 0.063438 | 0.217281 | 3.036388 | 2.109323 | -0.557433 | 1.938718 | 0.512674 | -2.694194 | 0 |
| 4996 | -5.172498 | 1.171653 | 1.579105 | 1.219922 | 2.529627 | -0.668648 | -2.618321 | -2.000545 | 0.633791 | -0.578938 | ... | 4.423900 | 2.603811 | -2.152170 | 0.917401 | 2.156586 | 0.466963 | 0.470120 | 2.196756 | -2.376515 | 0 |
| 4997 | -1.114136 | -0.403576 | -1.764875 | -5.879475 | 3.571558 | 3.710802 | -2.482952 | -0.307614 | -0.921945 | -2.999141 | ... | 3.791778 | 7.481506 | -10.061396 | -0.387166 | 1.848509 | 1.818248 | -1.245633 | -1.260876 | 7.474682 | 0 |
| 4998 | -1.703241 | 0.614650 | 6.220503 | -0.104132 | 0.955916 | -3.278706 | -1.633855 | -0.103936 | 1.388152 | -1.065622 | ... | -4.100352 | -5.949325 | 0.550372 | -1.573640 | 6.823936 | 2.139307 | -4.036164 | 3.436051 | 0.579249 | 0 |
| 4999 | -0.603701 | 0.959550 | -0.720995 | 8.229574 | -1.815610 | -2.275547 | -2.574524 | -1.041479 | 4.129645 | -2.731288 | ... | 2.369776 | -1.062408 | 0.790772 | 4.951955 | -7.440825 | -0.069506 | -0.918083 | -2.291154 | -5.362891 | 0 |
5 rows × 41 columns
#viewing number of rows and columns in df_train
df_train.shape
(20000, 41)
#viewing number of rows and columns in df_test
df_test.shape
(5000, 41)
#viewing datatypes and column information for df_train
df_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 19982 non-null float64 1 V2 19982 non-null float64 2 V3 20000 non-null float64 3 V4 20000 non-null float64 4 V5 20000 non-null float64 5 V6 20000 non-null float64 6 V7 20000 non-null float64 7 V8 20000 non-null float64 8 V9 20000 non-null float64 9 V10 20000 non-null float64 10 V11 20000 non-null float64 11 V12 20000 non-null float64 12 V13 20000 non-null float64 13 V14 20000 non-null float64 14 V15 20000 non-null float64 15 V16 20000 non-null float64 16 V17 20000 non-null float64 17 V18 20000 non-null float64 18 V19 20000 non-null float64 19 V20 20000 non-null float64 20 V21 20000 non-null float64 21 V22 20000 non-null float64 22 V23 20000 non-null float64 23 V24 20000 non-null float64 24 V25 20000 non-null float64 25 V26 20000 non-null float64 26 V27 20000 non-null float64 27 V28 20000 non-null float64 28 V29 20000 non-null float64 29 V30 20000 non-null float64 30 V31 20000 non-null float64 31 V32 20000 non-null float64 32 V33 20000 non-null float64 33 V34 20000 non-null float64 34 V35 20000 non-null float64 35 V36 20000 non-null float64 36 V37 20000 non-null float64 37 V38 20000 non-null float64 38 V39 20000 non-null float64 39 V40 20000 non-null float64 40 Target 20000 non-null int64 dtypes: float64(40), int64(1) memory usage: 6.3 MB
#viewing datatypes and column information for df_test
df_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 4995 non-null float64 1 V2 4994 non-null float64 2 V3 5000 non-null float64 3 V4 5000 non-null float64 4 V5 5000 non-null float64 5 V6 5000 non-null float64 6 V7 5000 non-null float64 7 V8 5000 non-null float64 8 V9 5000 non-null float64 9 V10 5000 non-null float64 10 V11 5000 non-null float64 11 V12 5000 non-null float64 12 V13 5000 non-null float64 13 V14 5000 non-null float64 14 V15 5000 non-null float64 15 V16 5000 non-null float64 16 V17 5000 non-null float64 17 V18 5000 non-null float64 18 V19 5000 non-null float64 19 V20 5000 non-null float64 20 V21 5000 non-null float64 21 V22 5000 non-null float64 22 V23 5000 non-null float64 23 V24 5000 non-null float64 24 V25 5000 non-null float64 25 V26 5000 non-null float64 26 V27 5000 non-null float64 27 V28 5000 non-null float64 28 V29 5000 non-null float64 29 V30 5000 non-null float64 30 V31 5000 non-null float64 31 V32 5000 non-null float64 32 V33 5000 non-null float64 33 V34 5000 non-null float64 34 V35 5000 non-null float64 35 V36 5000 non-null float64 36 V37 5000 non-null float64 37 V38 5000 non-null float64 38 V39 5000 non-null float64 39 V40 5000 non-null float64 40 Target 5000 non-null int64 dtypes: float64(40), int64(1) memory usage: 1.6 MB
#viewing a statistical summary for the columns in df_train
df_train.describe()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 19982.000000 | 19982.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | ... | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 |
| mean | -0.271996 | 0.440430 | 2.484699 | -0.083152 | -0.053752 | -0.995443 | -0.879325 | -0.548195 | -0.016808 | -0.012998 | ... | 0.303799 | 0.049825 | -0.462702 | 2.229620 | 1.514809 | 0.011316 | -0.344025 | 0.890653 | -0.875630 | 0.055500 |
| std | 3.441625 | 3.150784 | 3.388963 | 3.431595 | 2.104801 | 2.040970 | 1.761626 | 3.295756 | 2.160568 | 2.193201 | ... | 5.500400 | 3.575285 | 3.183841 | 2.937102 | 3.800860 | 1.788165 | 3.948147 | 1.753054 | 3.012155 | 0.228959 |
| min | -11.876451 | -12.319951 | -10.708139 | -15.082052 | -8.603361 | -10.227147 | -7.949681 | -15.657561 | -8.596313 | -9.853957 | ... | -19.876502 | -16.898353 | -17.985094 | -15.349803 | -14.833178 | -5.478350 | -17.375002 | -6.438880 | -11.023935 | 0.000000 |
| 25% | -2.737146 | -1.640674 | 0.206860 | -2.347660 | -1.535607 | -2.347238 | -2.030926 | -2.642665 | -1.494973 | -1.411212 | ... | -3.420469 | -2.242857 | -2.136984 | 0.336191 | -0.943809 | -1.255819 | -2.987638 | -0.272250 | -2.940193 | 0.000000 |
| 50% | -0.747917 | 0.471536 | 2.255786 | -0.135241 | -0.101952 | -1.000515 | -0.917179 | -0.389085 | -0.067597 | 0.100973 | ... | 0.052073 | -0.066249 | -0.255008 | 2.098633 | 1.566526 | -0.128435 | -0.316849 | 0.919261 | -0.920806 | 0.000000 |
| 75% | 1.840112 | 2.543967 | 4.566165 | 2.130615 | 1.340480 | 0.380330 | 0.223695 | 1.722965 | 1.409203 | 1.477045 | ... | 3.761722 | 2.255134 | 1.436935 | 4.064358 | 3.983939 | 1.175533 | 2.279399 | 2.057540 | 1.119897 | 0.000000 |
| max | 15.493002 | 13.089269 | 17.090919 | 13.236381 | 8.133797 | 6.975847 | 8.006091 | 11.679495 | 8.137580 | 8.108472 | ... | 23.633187 | 16.692486 | 14.358213 | 15.291065 | 19.329576 | 7.467006 | 15.289923 | 7.759877 | 10.654265 | 1.000000 |
8 rows × 41 columns
#viewing a statistical summary for the columns in df_test
df_test.describe()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4995.000000 | 4994.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | ... | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 |
| mean | -0.277622 | 0.397928 | 2.551787 | -0.048943 | -0.080120 | -1.042138 | -0.907922 | -0.574592 | 0.030121 | 0.018524 | ... | 0.232567 | -0.080115 | -0.392663 | 2.211205 | 1.594845 | 0.022931 | -0.405659 | 0.938800 | -0.932406 | 0.056400 |
| std | 3.466280 | 3.139562 | 3.326607 | 3.413937 | 2.110870 | 2.005444 | 1.769017 | 3.331911 | 2.174139 | 2.145437 | ... | 5.585628 | 3.538624 | 3.166101 | 2.948426 | 3.774970 | 1.785320 | 3.968936 | 1.716502 | 2.978193 | 0.230716 |
| min | -12.381696 | -10.716179 | -9.237940 | -14.682446 | -7.711569 | -8.924196 | -8.124230 | -12.252731 | -6.785495 | -8.170956 | ... | -17.244168 | -14.903781 | -14.699725 | -12.260591 | -12.735567 | -5.079070 | -15.334533 | -5.451050 | -10.076234 | 0.000000 |
| 25% | -2.743691 | -1.649211 | 0.314931 | -2.292694 | -1.615238 | -2.368853 | -2.054259 | -2.642088 | -1.455712 | -1.353320 | ... | -3.556267 | -2.348121 | -2.009604 | 0.321818 | -0.866066 | -1.240526 | -2.984480 | -0.208024 | -2.986587 | 0.000000 |
| 50% | -0.764767 | 0.427369 | 2.260428 | -0.145753 | -0.131890 | -1.048571 | -0.939695 | -0.357943 | -0.079891 | 0.166292 | ... | -0.076694 | -0.159713 | -0.171745 | 2.111750 | 1.702964 | -0.110415 | -0.381162 | 0.959152 | -1.002764 | 0.000000 |
| 75% | 1.831313 | 2.444486 | 4.587000 | 2.166468 | 1.341197 | 0.307555 | 0.212228 | 1.712896 | 1.449548 | 1.511248 | ... | 3.751857 | 2.099160 | 1.465402 | 4.031639 | 4.104409 | 1.237522 | 2.287998 | 2.130769 | 1.079738 | 0.000000 |
| max | 13.504352 | 14.079073 | 15.314503 | 12.140157 | 7.672835 | 5.067685 | 7.616182 | 10.414722 | 8.850720 | 6.598728 | ... | 26.539391 | 13.323517 | 12.146302 | 13.489237 | 17.116122 | 6.809938 | 13.064950 | 7.182237 | 8.698460 | 1.000000 |
8 rows × 41 columns
#viewing duplicate values in df_train
df_train.duplicated().sum()
0
#viewing duplicate values in df_test
df_test.duplicated().sum()
0
There are no duplicate values in df_train and df_test.
#viewing missing values in df_train
df_train.isnull().sum()
V1 18 V2 18 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 Target 0 dtype: int64
#viewing missing values in df_test
df_test.isnull().sum()
V1 5 V2 6 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 Target 0 dtype: int64
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={'height_ratios': (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color='violet'
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette='winter'
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color='green', linestyle='--'
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color='black', linestyle='-'
) # Add median to the histogram
for feature in df_train.columns: #for each column in df_train
histogram_boxplot(df_train, feature, figsize=(12, 7), kde=False, bins=None) #plot column on x-axis
for feature in df_test.columns: #for each column in df_test
histogram_boxplot(df_test, feature, figsize=(12, 7), kde=False, bins=None) #plot column on x-axis
#making a heatmap using columns from df_train to view correlation
#correlation range is from -1 to 1, labels are displayed and limited to 2 decimal places
plt.subplots(figsize=(20,15)) #adjusting size of heatmap to make sure all variables can be seen
sns.heatmap(df_train.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f')
plt.show(); #displaying heatmap
#making a heatmap using columns from df_test to view correlation
#correlation range is from -1 to 1, labels are displayed and limited to 2 decimal places
plt.subplots(figsize=(20,15)) #adjusting size of heatmap to make sure all variables can be seen
sns.heatmap(df_test.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f')
plt.show(); #displaying heatmap
There are no columns that need to be dropped. Although there are some correlated columns, the correlation is not greater than 0.90 or less than -0.90. Also, the column names are hidden/encoded, so it is unknown which columns are related to each other or which columns belong together. It is best not to drop them.
There are no categorical variables that need to be encoded. All of the columns, except the target variable, are float datatypes. The target column is an integer datatype.
The values of the target variable are already in 0 and 1 format, so there is no need to change them.
In the univariate analysis section, it can be seen that there are outliers. They are authentic data points, so they will not be changed.
Since the test set is already separate, there is no need to split data into train and test.
The test set needs to be organized into X_test and y_test.
However, the training set needs to be split into training and validation.
#adding all columns of df_test, except Target, to X_test
X_test = df_test.drop(['Target'], axis=1)
#adding Target column to y_test
y_test = df_test['Target']
#adding all columns of df_train, except Target, to X
X = df_train.drop(['Target'], axis=1)
#adding Target column to Y
Y = df_train['Target']
#dividing data in df_train into train and validation set using a test_size of 0.25
X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=0.25, random_state=1, stratify=Y)
#checking number of columns in each set
print(X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape)
(15000, 40) (5000, 40) (5000, 40) (15000,) (5000,) (5000,)
The data in df_train has been split into training and validation sets. The df_test set has been organized into X_test and y_test. There are 15,000 rows in the training set. In the testing and validation sets, there are 5,000 rows each.
The V1 and V2 columns are both a float datatype. Since the V1 and V2 columns have outliers in the df_train and df_test set, it is best to impute the missing values with the median.
To ensure no data leakage, the imputation of missing values is completed after the splitting of data.
#assigning imputer to SimpleImputer and setting imputing strategy to median
imputer = SimpleImputer(strategy='median')
#fitting and using imputer to transform the X_train set
X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
#using imputer to transform the X_val set
X_val = pd.DataFrame(imputer.transform(X_val), columns=X_val.columns)
#using imputer to transform the X_test set
X_test = pd.DataFrame(imputer.transform(X_test), columns=X_test.columns)
#viewing missing values in X_train
X_train.isnull().sum()
V1 0 V2 0 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 dtype: int64
#viewing missing values in X_val
X_val.isnull().sum()
V1 0 V2 0 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 dtype: int64
#viewing missing values in X_test
X_test.isnull().sum()
V1 0 V2 0 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 dtype: int64
#viewing missing values in y_train, y_val, and y_test
print(y_train.isnull().sum(),y_val.isnull().sum(),y_test.isnull().sum())
0 0 0
There are no missing values anymore in the training, testing or validation sets.
for feature in X_train.columns: #for each column in X_train
histogram_boxplot(X_train, feature, figsize=(12, 7), kde=False, bins=None) #plot column on x-axis
for feature in X_val.columns: #for each column in X_val
histogram_boxplot(X_val, feature, figsize=(12, 7), kde=False, bins=None) #plot column on x-axis
for feature in X_test.columns: #for each column in X_test
histogram_boxplot(X_test, feature, figsize=(12, 7), kde=False, bins=None) #plot column on x-axis
After imputing missing values, there are no significant changes in the distributions of each variable in the training, validation or testing sets.
#making a heatmap using columns from X_train to view correlation
#correlation range is from -1 to 1, labels are displayed and limited to 2 decimal places
plt.subplots(figsize=(20,15)) #adjusting size of heatmap to make sure all variables can be seen
sns.heatmap(X_train.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f')
plt.show(); #displaying heatmap
#making a heatmap using columns from X_val to view correlation
#correlation range is from -1 to 1, labels are displayed and limited to 2 decimal places
plt.subplots(figsize=(20,15)) #adjusting size of heatmap to make sure all variables can be seen
sns.heatmap(X_val.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f')
plt.show(); #displaying heatmap
#making a heatmap using columns from X_test to view correlation
#correlation range is from -1 to 1, labels are displayed and limited to 2 decimal places
plt.subplots(figsize=(20,15)) #adjusting size of heatmap to make sure all variables can be seen
sns.heatmap(X_test.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f')
plt.show(); #displaying heatmap
After imputing missing values, there are no significant changes in the correlation in the training, validation or testing sets.
The nature of predictions made by the classification model will translate as follows:
Which metric to optimize?
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
'Accuracy': acc,
'Recall': recall,
'Precision': precision,
'F1': f1
},
index=[0],
)
return df_perf
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
Sample Decision Tree model building with original data
models = [] # Empty list to store all the models
# Appending models into the list
models.append(('Logistic Regression', LogisticRegression(random_state=1)))
models.append(('Decision Tree', DecisionTreeClassifier(random_state=1)))
models.append(('Random Forest', RandomForestClassifier(random_state=1)))
models.append(('Bagging', BaggingClassifier(random_state=1)))
models.append(('Adaboost', AdaBoostClassifier(random_state=1)))
models.append(('Gradient Boosting', GradientBoostingClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print('\n' 'Cross-Validation performance on training dataset:' '\n')
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print('{}: {}'.format(name, cv_result.mean()))
print('\n' 'Validation Performance:' '\n')
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print('{}: {}'.format(name, scores))
Cross-Validation performance on training dataset: Logistic Regression: 0.4927566553639709 Decision Tree: 0.6982829521679532 Random Forest: 0.7235192266070268 Bagging: 0.7210807301060529 Adaboost: 0.6309140754635308 Gradient Boosting: 0.7066661857008874 Validation Performance: Logistic Regression: 0.48201438848920863 Decision Tree: 0.7050359712230215 Random Forest: 0.7266187050359713 Bagging: 0.7302158273381295 Adaboost: 0.6762589928057554 Gradient Boosting: 0.7230215827338129
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(('Logistic Regression - Oversampled Data', LogisticRegression(random_state=1)))
models.append(('Decision Tree - Oversampled Data', DecisionTreeClassifier(random_state=1)))
models.append(('Random Forest - Oversampled Data', RandomForestClassifier(random_state=1)))
models.append(('Bagging - Oversampled Data', BaggingClassifier(random_state=1)))
models.append(('Adaboost - Oversampled Data', AdaBoostClassifier(random_state=1)))
models.append(('Gradient Boosting - Oversampled Data', GradientBoostingClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print('\n' 'Cross-Validation performance on training dataset:' '\n')
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print('{}: {}'.format(name, cv_result.mean()))
print('\n' 'Validation Performance:' '\n')
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_val, model.predict(X_val))
print('{}: {}'.format(name, scores))
Cross-Validation performance on training dataset: Logistic Regression - Oversampled Data: 0.883963699328486 Decision Tree - Oversampled Data: 0.9720494245534969 Random Forest - Oversampled Data: 0.9839075260047615 Bagging - Oversampled Data: 0.9762141471581656 Adaboost - Oversampled Data: 0.8978689011775473 Gradient Boosting - Oversampled Data: 0.9256068151319724 Validation Performance: Logistic Regression - Oversampled Data: 0.8489208633093526 Decision Tree - Oversampled Data: 0.7769784172661871 Random Forest - Oversampled Data: 0.8489208633093526 Bagging - Oversampled Data: 0.8345323741007195 Adaboost - Oversampled Data: 0.8561151079136691 Gradient Boosting - Oversampled Data: 0.8776978417266187
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(('Logistic Regression - Undersampled Data', LogisticRegression(random_state=1)))
models.append(('Decision Tree - Undersampled Data', DecisionTreeClassifier(random_state=1)))
models.append(('Random Forest - Undersampled Data', RandomForestClassifier(random_state=1)))
models.append(('Bagging - Undersampled Data', BaggingClassifier(random_state=1)))
models.append(('Adaboost - Undersampled Data', AdaBoostClassifier(random_state=1)))
models.append(('Gradient Boosting - Undersampled Data', GradientBoostingClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print('\n' 'Cross-Validation performance on training dataset:' '\n')
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print('{}: {}'.format(name, cv_result.mean()))
print('\n' 'Validation Performance:' '\n')
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_val, model.predict(X_val))
print('{}: {}'.format(name, scores))
Cross-Validation performance on training dataset: Logistic Regression - Undersampled Data: 0.8726138085275232 Decision Tree - Undersampled Data: 0.8617776495202367 Random Forest - Undersampled Data: 0.9038669648654498 Bagging - Undersampled Data: 0.8641945025611427 Adaboost - Undersampled Data: 0.8666113556020489 Gradient Boosting - Undersampled Data: 0.8978572974532861 Validation Performance: Logistic Regression - Undersampled Data: 0.8525179856115108 Decision Tree - Undersampled Data: 0.841726618705036 Random Forest - Undersampled Data: 0.8920863309352518 Bagging - Undersampled Data: 0.8705035971223022 Adaboost - Undersampled Data: 0.8489208633093526 Gradient Boosting - Undersampled Data: 0.8884892086330936
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'n_estimators': [200,250,300],
'min_samples_leaf': np.arange(1,4),
'max_features' : [np.arange(0.3,0.6,0.1),'sqrt'],
'max_samples': np.arange(0.4,0.7,0.1) }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
#creating model with the best combination found
tuned_random_forest = randomized_cv.best_estimator_
#applying the combination of parameters on X_train and y_train
tuned_random_forest.fit(X_train, y_train)
RandomForestClassifier(max_samples=0.6, n_estimators=250, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(max_samples=0.6, n_estimators=250, random_state=1)
#calculating metrics on training set using predefined function/saving them to new variable
tuned_random_metrics_training = model_performance_classification_sklearn(tuned_random_forest, X_train, y_train)
print('Training Metrics')
print(tuned_random_metrics_training) #printing tuned_random_metrics_training
Training Metrics Accuracy Recall Precision F1 0 0.994933 0.908654 1.0 0.952141
#confusion matrix
X_train_pred = tuned_random_forest.predict(X_train) #making predictions for X_train using model
matrix_train = confusion_matrix(y_train, X_train_pred) #building confusion matrix with y_train and X_train_pred
sns.heatmap(matrix_train, fmt='g', annot=True) #creating heatmap for matrix/annotation labels displayed as full numbers
plt.title('Confusion matrix - Training Set') #heatmap title
plt.xlabel('Predicted Values') #x-axis title
plt.ylabel('Actual Values') #y-axis title
plt.show(); #displaying heatmap
#calculating metrics on validation set using predefined function/saving them to new variable
tuned_random_metrics_val = model_performance_classification_sklearn(tuned_random_forest, X_val, y_val)
print('Validation Metrics')
print(tuned_random_metrics_val) #printing tuned_random_metrics_val
Validation Metrics Accuracy Recall Precision F1 0 0.9834 0.71223 0.985075 0.826722
#confusion matrix
X_val_pred = tuned_random_forest.predict(X_val) #making predictions for X_val using model
matrix_val = confusion_matrix(y_val, X_val_pred) #building confusion matrix with y_val and X_val_pred
sns.heatmap(matrix_val, fmt='g', annot=True) #creating heatmap for matrix/annotation labels displayed as full numbers
plt.title('Confusion matrix - Validation Set') #heatmap title
plt.xlabel('Predicted Values') #x-axis title
plt.ylabel('Actual Values') #y-axis title
plt.show(); #displaying heatmap
The recall for training increased from 0.72 to 0.90, but the recall for validation decreased from 0.72 to 0.71. In addition to that, there is a large difference between the recall scores for training and validation. This model seems to overfit the training data. It does not perform as consistently on the validation data.
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,20),
'min_samples_leaf': [1, 2, 5, 7],
'max_leaf_nodes' : [5, 10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
#creating model with the best combination found
tuned_random_forest_under = randomized_cv.best_estimator_
#applying the combination of parameters on X_train_un and y_train_un
tuned_random_forest_under.fit(X_train_un, y_train_un)
RandomForestClassifier(max_depth=11, max_leaf_nodes=15,
min_impurity_decrease=0.001, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(max_depth=11, max_leaf_nodes=15,
min_impurity_decrease=0.001, random_state=1)#calculating metrics on training set using predefined function/saving them to new variable
tuned_random_under_metrics_training = model_performance_classification_sklearn(tuned_random_forest_under, X_train, y_train)
print('Training Metrics')
print(tuned_random_under_metrics_training) #printing tuned_random_under_metrics_training
Training Metrics Accuracy Recall Precision F1 0 0.912867 0.907452 0.380353 0.536031
#confusion matrix
X_train_pred = tuned_random_forest_under.predict(X_train) #making predictions for X_train using model
matrix_train = confusion_matrix(y_train, X_train_pred) #building confusion matrix with y_train and X_train_pred
sns.heatmap(matrix_train, fmt='g', annot=True) #creating heatmap for matrix/annotation labels displayed as full numbers
plt.title('Confusion matrix - Training Set') #heatmap title
plt.xlabel('Predicted Values') #x-axis title
plt.ylabel('Actual Values') #y-axis title
plt.show(); #displaying heatmap
#calculating metrics on validation set using predefined function/saving them to new variable
tuned_random_under_metrics_val = model_performance_classification_sklearn(tuned_random_forest_under, X_val, y_val)
print('Validation Metrics')
print(tuned_random_under_metrics_val) #printing tuned_random_under_metrics_val
Validation Metrics Accuracy Recall Precision F1 0 0.9012 0.884892 0.347458 0.498986
#confusion matrix
X_val_pred = tuned_random_forest_under.predict(X_val) #making predictions for X_val using model
matrix_val = confusion_matrix(y_val, X_val_pred) #building confusion matrix with y_val and X_val_pred
sns.heatmap(matrix_val, fmt='g', annot=True) #creating heatmap for matrix/annotation labels displayed as full numbers
plt.title('Confusion matrix - Validation Set') #heatmap title
plt.xlabel('Predicted Values') #x-axis title
plt.ylabel('Actual Values') #y-axis title
plt.show(); #displaying heatmap
The recall for training remained the same at 0.90, but the recall for validation decreased from 0.89 to 0.88. However, that is still a decent score. The training and validation have a similar recall value, so the model is performing consistently across both datasets.
# defining model
Model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'n_estimators': np.arange(100,150,25),
'learning_rate': [0.2,0.05,1],
'subsample' : [0.5,0.7],
'max_features': [0.5,0.7] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
#creating model with the best combination found
tuned_gradient_under = randomized_cv.best_estimator_
#applying the combination of parameters on X_train_un and y_train_un
tuned_gradient_under.fit(X_train_un, y_train_un)
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
random_state=1, subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
random_state=1, subsample=0.7)#calculating metrics on training set using predefined function/saving them to new variable
tuned_gradient_under_metrics_training = model_performance_classification_sklearn(tuned_gradient_under, X_train, y_train)
print('Training Metrics')
print(tuned_gradient_under_metrics_training) #printing tuned_gradient_under_metrics_training
Training Metrics Accuracy Recall Precision F1 0 0.887867 0.981971 0.328905 0.492762
#confusion matrix
X_train_pred = tuned_gradient_under.predict(X_train) #making predictions for X_train using model
matrix_train = confusion_matrix(y_train, X_train_pred) #building confusion matrix with y_train and X_train_pred
sns.heatmap(matrix_train, fmt='g', annot=True) #creating heatmap for matrix/annotation labels displayed as full numbers
plt.title('Confusion matrix - Training Set') #heatmap title
plt.xlabel('Predicted Values') #x-axis title
plt.ylabel('Actual Values') #y-axis title
plt.show(); #displaying heatmap
#calculating metrics on validation set using predefined function/saving them to new variable
tuned_gradient_under_metrics_val = model_performance_classification_sklearn(tuned_gradient_under, X_val, y_val)
print('Validation Metrics')
print(tuned_gradient_under_metrics_val) #printing tuned_gradient_under_metrics_val
Validation Metrics Accuracy Recall Precision F1 0 0.8792 0.874101 0.299261 0.445872
#confusion matrix
X_val_pred = tuned_gradient_under.predict(X_val) #making predictions for X_val using model
matrix_val = confusion_matrix(y_val, X_val_pred) #building confusion matrix with y_val and X_val_pred
sns.heatmap(matrix_val, fmt='g', annot=True) #creating heatmap for matrix/annotation labels displayed as full numbers
plt.title('Confusion matrix - Validation Set') #heatmap title
plt.xlabel('Predicted Values') #x-axis title
plt.ylabel('Actual Values') #y-axis title
plt.show(); #displaying heatmap
The recall for training increased from 0.89 to 0.98, but the recall for validation decreased from 0.88 to 0.87. Although it is a decent score, there is a large difference in the training and validation recall scores. The model is not generalizing as well to the validation set.
print(tuned_random_metrics_training)
Accuracy Recall Precision F1 0 0.994933 0.908654 1.0 0.952141
print(tuned_random_metrics_val)
Accuracy Recall Precision F1 0 0.9834 0.71223 0.985075 0.826722
print(tuned_random_under_metrics_training)
Accuracy Recall Precision F1 0 0.912867 0.907452 0.380353 0.536031
print(tuned_random_under_metrics_val)
Accuracy Recall Precision F1 0 0.9012 0.884892 0.347458 0.498986
print(tuned_gradient_under_metrics_training)
Accuracy Recall Precision F1 0 0.887867 0.981971 0.328905 0.492762
print(tuned_gradient_under_metrics_val)
Accuracy Recall Precision F1 0 0.8792 0.874101 0.299261 0.445872
The Tuned Random Forest - Undersampled Data model has a decent recall score on the training and testing set. The scores are also very similar for both. This model will perform consistently on different datasets, so this is the final model.
#confusion matrix
X_test_pred = tuned_random_forest_under.predict(X_test) #making predictions for X_test using model
matrix_test = confusion_matrix(y_test, X_test_pred) #building confusion matrix with y_test and X_test_pred
sns.heatmap(matrix_test, fmt='g', annot=True) #creating heatmap for matrix/annotation labels displayed as full numbers
plt.title('Confusion matrix - Testing Set') #heatmap title
plt.xlabel('Predicted Values') #x-axis title
plt.ylabel('Actual Values') #y-axis title
plt.show(); #displaying heatmap
#calculating metrics on testing set using predefined function/saving them to new variable
tuned_random_forest_under_metrics_test = model_performance_classification_sklearn(tuned_random_forest_under, X_test, y_test)
print('Testing Metrics')
print(tuned_random_forest_under_metrics_test) #printing tuned_random_forest_under_metrics_test
Testing Metrics Accuracy Recall Precision F1 0 0.9128 0.85461 0.378931 0.525054
#viewing training metrics again
tuned_random_under_metrics_training
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.912867 | 0.907452 | 0.380353 | 0.536031 |
#creating a pipeline called pipeline
#first step is Imputer where missing values are imported
#second step is RFU where final model is created (random forest (undersampled data))
pipeline = Pipeline(
steps=[
('Imputer', imputer),
('RFU', RandomForestClassifier(max_depth=11, max_leaf_nodes=15,
min_impurity_decrease=0.001
),
),
]
)
#adding all columns of df_test, except Target, to X_test
X_test = df_test.drop(['Target'], axis=1)
#adding Target column to y_test
y_test = df_test['Target']
#adding all columns of df_train, except Target, to X
X = df_train.drop(['Target'], axis=1)
#adding Target column to Y
Y = df_train['Target']
#fitting and using imputer to transform the X set
imputer = SimpleImputer(strategy='median')
X = imputer.fit_transform(X)
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
#fitting pipeline on undersampled training data
pipeline.fit(X_train_un, y_train_un)
Pipeline(steps=[('Imputer', SimpleImputer(strategy='median')),
('RFU',
RandomForestClassifier(max_depth=11, max_leaf_nodes=15,
min_impurity_decrease=0.001))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('Imputer', SimpleImputer(strategy='median')),
('RFU',
RandomForestClassifier(max_depth=11, max_leaf_nodes=15,
min_impurity_decrease=0.001))])SimpleImputer(strategy='median')
RandomForestClassifier(max_depth=11, max_leaf_nodes=15,
min_impurity_decrease=0.001)#viewing pipeline training metrics
#calculating metrics on training set using predefined function/saving them to new variable
pipeline_training = model_performance_classification_sklearn(pipeline, X_train, y_train)
print('Training Metrics')
print(pipeline_training) #printing pipeline_training
Training Metrics Accuracy Recall Precision F1 0 0.910933 0.90024 0.374126 0.528582
#viewing pipeline testing metrics
#calculating metrics on testing set using predefined function/saving them to new variable
pipeline_test = model_performance_classification_sklearn(pipeline, X_test, y_test)
print('Testing Metrics')
print(pipeline_test) #printing pipeline_test
Testing Metrics Accuracy Recall Precision F1 0 0.9068 0.85461 0.361862 0.508439